Mathematics  Research  Center 
University  of  Wisconsin— Madison 
610  Walnut  Street 
Madison,  Wisconsin  53706 


>- 

q  September  1 979 

iuJ 

■ — 1  (Received  July  27,  1979) 


/ 

/ 


VT'l  •** 

l^3 


\  • 


y 


s 

1 


l§ 


K 

\ 


Approved  for  public  release 
Distribution  unlimited 


Sponsored  by 

U.S.  Army  Research  Office 
P.0.  Box  12211 
Research  Traingle  Park 
North  Carolina  277C9 


15 


UNIVERSITY  OF  WISCONSIN-MADISON 
MATHEMATICS  RESEARCH  CENTER 


FURTHER  STUDY  OF  ROBUSTIFICATION 
VIA  A  BAYESIAN  APPROACH 

Gina  Chen  and  George  E.  P.  Box 

Technical  Summary  Report  #  1998 
September  1979 

ABSTRACT 

The  Bayesian  outlier  procedure  discussed  by  Box  and  Tiao  (1968)  which  uses 
the  contaminated  normal  model  is  further  explored  in  this  report.  For  a  simple 
location  estimate  suggested  by  their  method,  the  weight  given  each  observation 
is  expressed  explicitly  in  terms  of  standardized  residuals  so  allowing  compari¬ 
son  with  M-estimators. 


AMS  (MOS)  Subject  Classification:  62G35 

Key  Words:  Contaminated  normal  model,  Bayesian  procedure,  M-estimator, 
weighting  function,  standardized  residuals 

Work  Unit  Number  4:  Probability,  Statistics,  and  Combinatorics 


I  ,  I  ha*  bM«  approved 

,  tor  public  »!«■*  rwd  «1«;  jfe 

'  is  united. 


Sponsored  by  the  United  States  Army  under  Contract  No.  DAAG29-75-C-0024 . 


FURTHER  STUDY  OF  RO BU ST  I F I CAT  I ON 
VIA  A  BAYESIAN  APPROACH 

Gina  Chen  and  George  E.  P.  Box 

1.  Introduction 

Statistics  is  a  science  which  helps  extract  meaningful  infor¬ 
mation  from  data  which  are  subject  to  errors.  Two  important  aspects 
of  statistical  analysis  are 

(i)  The  building  of  probabilistic  models,  defining  the  functional 
form  of  a  relationship  between  variables  as  well  as  the 
distribution  of  the  errors. 

(ii)  The  development  of  techniques  of  analysis  which  are  efficient 
supposing  the  data  to  be  derived  from  the  specified  model. 
Model  building  begins  with  preliminary  and  graphic  analysis  of  the 
data.  This  information-  is  combined  with  theoretical  knowledge  about 
the  system  to  produce  a  tentative  model.  One  desirable  criterion 
for  choosing  a  model  is  simplicity.  Such  a  model  is  easy  to  inter¬ 
pret  and  can  give  greater  precision  than  less  parsimonious  models 
(Box,  1978).  An  iterative  model  building  procedure  is  needed  which 
can  move  from  a  perhaps  initially  faulty  model  to  one  which  is 
adequate  for  the  purpose  at  hand. 

Two  techniques  are  of  value  in  this  process. 

(D  Robustification:  By  employing  robust  methods  (or  models) 
we  are  unlikely  to  be  misled  by  those  model  inadequacies 
guarded  against. 
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(ii)  Iterative  fixing:  Conditional  inference  can  be  made 

assuming  that  the  tentative  model  is  true,  and  one  can  then 
make  diagnostic  checks  to  reveal  possible  inadequacies. 

If  there  are  indications  of  inadequacy,  the  model  can  be 
modified  appropriately. 

Both  techniques  have  their  advantages  ar.d  limitations. 

It  is  impossible  to  make  a  procedure  robust  against  every 
possible  contingency.  On  the  other  hand,  situations  can  arise 
where  tests  may  fail  to  show  up  inadequacies  that  could, 
nevertheless,  cause  serious  problems.  Looking  for  inadequacies 
after  the  event  may  not  always  be  successful. 

A  practical  policy  is  to  robustify  for  one  or  more 
contingencies  that  seem  likely  in  the  context  of  a  particular 
application  and  then  to  study  the  resiauals  to  seek  for  the 
unexpected. 

The  Standard  Statistical  Models 

Consider  the  standard  statistical  model 

yi  =  n(x^,0)  +  e..  i  =  l,...,n  (  1.1) 

where  c^'s  arc  independently,  identically  and  normally  distri¬ 
buted.  In  this  model,  it  is  assumed  that 
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(i)  The  observation  y^  is  a  function  of  dependent  variables 
x.  and  parameter  0  plus  an  error  e.. 

(ii)  The  errors  e^e^,...^  are  independently  distributed. 

( i i i )  All  the  errors  e^'s  have  the  same  normal  distribution 
with  a  fixed  variance. 

Violations  of  any  of  these  assumptions  can  cause  problems. 
However,  this  thesis  is  concerned  only  with  departures  from  the 
third  assumption. 

It  is  well  known  that  real  data  can  be  affected  with  dis¬ 
cordant  values  produced  by  a  mistake  or  a  sudden  change  of  some 
unknown  factor.  It  is  usually,  therefore,  unrealistic  to  assume 
that  every  observation  is  from  the  same  normal  distribution.  A 
more  sensible  possibility  suggested  by  Dixon  (1953)  and  by  Tukey 
(1960)  is  to  allow  errors  to  be  from  two  different  distributions. 
The  standard  distribution  f(e)  occurs  with  probability  1-a 
and  an  alternative  distribution  g(e)  generates  discordant 
values  with  probability  a.  We  will  call  this  the  contaminated 
model.  From,  for  example,  the  study  made  in  Chapter  3  of  this 
thesis,  it  is  clear  that  this  is  not  the  only  possibility.  As 
we  saw  there,  a  model  which  could  be  realistic  in  some  cases  would 
have  errors  generated  from  one  distribution  up  to  a  certain  point 
and  then  randomly  switching  to  a  different  distribution  having 
either  a  different  mean  or  a  different  variance  or  both  for  a 
certain  period  of  time  and  then  switching  again.  A  model  which 
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has  this  property  for  switches  in  means  v/as  proposed  by  George 
Barnard  (1959)  in  his  justification  of  the  cumulative  sum  chart. 

We  will  only  consider  the  contaminated  model. 

2.  A  Bayesian  Analysis  of  the  Contaminated  Model 

In  those  situations  where  a  contaminated  distribution  is 
appropriate,  a  point  estimate  may  be  obtained  by  taking  the  mean 
of  the  posterior  distribution  of  the  appropriate  parameter  and 
Bayesian  intervals  can  be  calculated.  The  posterior  mean  takes  the 
form  of  a  weighted  average  in  which  the  weights  are  posterior  prob¬ 
abilities.  The  use  of  the  Bayesian  posterior  mean  as  an  estimator 
may  be  justified  on  the  grounds  that  it  minimizes  the  mean  square 
error  loss. 

The  formulation  of  Box  and  Tiao  (1968)  can  be  readily 
generalized  to  include  nonlinear  models  as  follows. 

Consider  a  model  Y  =  n(X,6)  +  e  where  Y  is  an  n  *  1 
vector  of  observations,  X  is  an  n  *  p  matrix  of  fixed  elements 
with  rank  p,  0  is  a  q  x  1  vector  of  parameters,  n(X,0) 

Is  an  n  x  l  vector  for  each  given  X  and  0,  and  e  is  an 
n  x  1  vector  of  random  errors.  Suppose  now  that  each  of  the 
errors  independently  comes  from  either  one  of  the  two  distributions 
—  a  standard  distribution  f(e|£j)  and  an  alternative  distribution 

g(e]^)»  where  and  ^  are  nuisance  parameters  and  0  is 

the  parameter  of  interest. 
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Let  be  the  event  that  a  particular  set  of  r  of  the 

n  e's  are  from  g(e|£2)  and  the  remaining  s  =  n-r  from 
f(e|5j).  Corresponding  to  a^,  vector  c  was  partitioned 

Into  e(r)>  G(s)"’  vector  Y  into  Y(r}’  Y(s)  and  matrix  x  int0 
X(f),  X^sj.  Let  P'  '  be  the  prior  probability  of  event  a(r)» 
PCQ.C-j.^)  be  the  prior  distribution  of  the  parameters.  The  pos¬ 
terior  distribution  is  then  given  by 


P(0|Y)  =  J  P(a(r) |Y)P(0|a(r) ,Y) 
where  P(e|a^j,Y)  = 

/P(9tgrg2)f(Y(s)-n(X(s)>B)/g1)g(Y(r)-n(X(r),e)/g2)dg1dg2 

/P(e,51,C2)f(Ytsj-n(X(sj,0)/?1)g(Y(r)-n(XCr)f0)/52)dc1d52d0 


where  f  denotes  the  product  of  densities  of  the  elements  of 
Y(s)-n(X(s) ,0)  and  g  that  of  Y(r)_ri(x(r) »0)-  The  denominator 

is  the  probability  of  the  event  that  the  are  drawn  from 


g(e|S2)  and  the  c ^  from  f (e | C-j ) •  If  we  denote  it  by 

f) 

(r) 


h(Y(r)~g’Y(s)~f)  then 


P(a(r)|V)= 


(S/^Vr^u)- 


h(X(rfg;Y(s)"'f) 


f) 


e  p(r)htYfrr^Y(si-f> 

D(o). 


'h(Y(rrfsY(srf) 


»(r) 


h(YJsff)h(Y 


P^h(V,  ,-f)h(V, 


IrX. 


P(r)h(Y(rr9|V(srO 


-f) 


91*, .1-0 


is  from 


where  h(Y^  ~  f)  is  the  marginal  probability  that 
f(e|£j),  h(Y^  -  g] Y^  -  f)  is  the  conditional  probability  that 
is  from  gUI^)  given  that  e^  is  from  f(e|£^). 

3_.  The  Contaminated  Normal  Distribution 

Since  all  experimental  data  are  subject  to  unpredictable 
mistakes,  a  probability  model  which  would  be  more  realistic  than 
the  standard  normal  model  would  often  be  one  where  with  probability 
1-cx,  the  error  is  normally  distributed  with  mean  0  and  variance 
o2  and  with  small  probability  a  the  error  has  a  mean  0  but  an 
inflated  variance  k2a2. 

One  argument  in  favor  of  this  model  is  as  follows.  The 
trimmed  mean  has  been  recommended  as  a  sensible  estimator  for 
location  parameters  by  Tukey  and  McLaughlin  (1963),  Huber  (1972) 
and  Stigler  (1977).  More  recently  M-estimators  have  been 
recommended  by  Huber,  Hampel  and  Andrews  (Andrews  et  al  (1972)). 
From  the  study  in  TSR#  1  997  we  have  seen  that  the  contaminated 
normal  distributions  would  produce  via  the  Bayesian  route  estima¬ 
tors  which  are  very  similar  to  those  recommended  by  these  authors. 

Furthermore,  results  of  TSR*  2002  suggest  that  heavy¬ 
tailed  distributions  are  sometimes  caused  by  inhomcgenoity  in  mean 
and  variance  such  as  might  be  encountered  early  in  an  experiment 
because  of  start-up  difficulties.  After  such  inhomogeneity  has 
been  allowed  for,  the  observations  seem  to  be  adequately  repre- 


-6- 


scnted  by  a  contaminated  normal  distribution.  Thus, for  a  carefully 
planned  experiment,  a  contaminated  normal  distribution  'is  likely 
to  be  appropriate. 


4.  Bayesian  Inference  with  Contaminated  Normal  Model 

Consider  that  the  error  distribution  is  a  contaminated 
normal  distribution  (l-a)N(0,o2)  +  aN(0,k2a2).  This  is  the 


2  —  2 


special  case  where  f (ej e  2o  ,  g(ek-)  =  — - — e  2k  0 

/2tto  ^TTkO 

if-!  f* 

and  Pv'  •  =  a'  (1-ct)"  ' .  Assume, a  priori  that  the  information 
about  0  and  about  a  are  independent  and, as  in  Box  and  Tiao 
(1968)  approximate  the  locally  noninformative  prior  by 

P(9,o)  «  P( 9 )  •  ^  .  Then 

P(0,SrS2)  =  P(0,a)  «  p(e)  1 

and  if  we  let  =  P(a^|Y)  and  p(r)(el Y)  =  p(0la(r)»Y)  we 


have 

with 


P(0!Y)  .  (Z)  w(r)P(r)(e|Y, 
P(r)(e|Y)«  /V(,'+1)P(e)exp[.-lrs(r)(8j 


do 


n 

T 


«  P(e)[s(r)(o)] 

where  S^rj(G)  =  (Yjg)-n(X^)  ,0)) 1  (Y(s)-n(X(s)  »6)) 


+  7T  »Yri^"ri(>:^^0))'(Y(r}-n(X(r),0)) 


k2  '  (r)  u  (r) 
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W/N  =  C 


plr^Y(rP-Ym-t) 

Pl0)MY(r)-fiV(s)-f) 


c(^r-) 

l-a 


fa~(n+1)P(e)k~re'  2S(r)(°)a  deda 

,  -fn+n  -if(Y-n(x>e))l(Y-n(x,e))]o-‘2 
/a  'P(0)e  2  dGdo 


a  /,  -r 


c  (^-)  k' 
1-cr 


/P(e)[s(r)(e)]  2  de 
/P(0)[(Y-n(x,e)) ' (Y-n(x,9))3~  n/2de 


where  c  is  some  constant. 

When  n  is  approximately  linear  in  0  over  the  range 
considered,  it  would  also  be  reasonable  to  take  P(0)  as  approxi¬ 
mately  constant.  In  such  case  the  above  formula  becomes 


P(r) (e I Y)  «  {S(r)(0)>  2 

K(.  .  Wk-r  — 

[  '  1_a  /{(Y-n(x,e))’(Y-n(x,0))}'  n/2d0 

A  robust  point  estimate  for  0  is  then  given  by  the  posterior 


/  6P ( 0 | Y ) dO  =  l  W/  \/0Pf  \ (0 | Y)d9  . 

(r)  \  \  i 

Furthermore,  probabilities  can  be  calculated  which  are  useful 
in  spotting  possible  bad  values.  (Observations  generated  by  g 
are  referred  to  as  bad  values.)  If  we  let  be  the  posterior 

probability  of  event  a^  given  the  sample  Y: 


be  the  event  that  the  observation  y^  is  from  g  and  all 
the  others  are  from  f. 

a.,  be  the  event  that  the  observations  y.  and  y.  are 

1 J  *  J 

from  g  and  all  the  others  are  from  f. 

Then  w.,  w..  are  the  posterior  probabilities  associated 
•  J 

with  a.  and  a... 

I  I J 

Using  these  individual  probabilities,  one  can  calculate 

q.,  the  posterior  probability  that  only  one  observation 

n 

is  from  g  is  £  w. 

i=l  1 

q£,  the  posterior  probability  that  only  two  observations 

n 

are  from  g  is  7  w.. 

.  i »j'l  1J 
i<j 

and  similarly  q^,  q^,...,etc. 

Also,  the  posterior  conditional  probabilities  can  be  cal¬ 
culated  as  follows: 

^i/1  =  observation  is  from  g/  there  is  only 

one  bad  value)  =  w^/q^ 

^ij/2  =  anc*  observations  are  from  g/there 

are  only  two  bad  values)  =  w../q? 


etc. 
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The  di strit.jt.ion  g  is  regarded  as  the  source  of  "Lad  values," 
and  these  probabilities  can  be  interpreted  as  the  posterior  prob¬ 
abilities  that  there  is  one  bad  value,  two  bad  values,. .. .etc. ,  and 
the  conditional  probabilities  as  the  posterior  probabilities  that  the 
i^th,  i^th,...,  i^th  observations  are  bad  values  given  that  there 

are  S,  bad  values. 

Although  the  posterior  probabilities  q/s  depend  on  the 
choices  of  a  and  k,  the  conditional  probabilities  are  independent 
of  a  and  rather  insensitive  to  k.  It  seems  that,  in  practice  by 
comparing  these  conditional  probabilities  one  can  readily  identify 
questionable  observations. 


5.  Some  Further  Study  of  the  Linear  Model 

When  n(X,e)  =  X0,  We  have  the  general  linear  mode1  and 
the  posterior  distribution  of  0  can  be  written  as  in  Box  and 
Tiao  in  the  following  way 


p(e|v>  =  (i)  V)V>(e|v) 


with 


’(r) 


1/2  r  2 

=  c(f-)Yr — -  j-lrl] 

"a  Ix'x-^yx.  J1/2  [  s2  J 


(  5.1) 


(r)  (r ) 


’(r)(°lY) 


-  r^)lx,x-»x(r),x(r) 


1/2 


r(^v)(mvs^rj) 


h 


„  |lt  (e-o,r))(»-«-W[r)'«(rl)(i>-o(r,) 

’S(r)  l  (  5.2) 


i 
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where 


V)  ■  (X(n-r)X(n-r)+^’<(,)X(r)>'ltX(n-r)V(n.r)+^(r)V(r)> 


f  5.3) 


’(r) 


=  vS(r)(0(r))  = 


-r)'X(n-r)6(rp 


v  jjX(n-i 

x  (Y(n-r)-X(n-r)6(r)) 

+  ^7(Y(r)'X(r)6(r)),(Y(r)-X(r)e(r))] 


(  5.4) 


v  =  n-p 


and 


*(o)  * 


The  posterior  distribution  P^(e|Y)  is  a  p-dimensional 

A 

multivariate  t-distribution  with  mean  dispersion  matrix 

s^r)(X'X-^X^r)X(r))~^  and  v  =  n-p  degrees  of  freedom.  It  is  then 
easy  to  see  that  the  robust  point  estimate  -  the  posterior  mean  -  is 

A 

equal  to  J  w(r)0(r)-  menti°nec*  before,  the  posterior  mean 
is  justified  because  it  minimizes  the  mean  square  error  loss. 


The  Weighting  Structure  of  this  Bayesian  Posterior  Mean 

A  most  important  feature  of  a  robust  estimator  is  the  weighting 
pattern  it  puts  on  the  observations.  As  expected,  for  an  estimator 
to  be  insensitive  to  outlying  observations,  the  weights  given  to 
extreme  observations  must  be  small.  Different  criteria  can  be  used. 

In  [.--estimators,  the  weight  given  to  an  observation  depends  on  the 
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percentage  point  it  takes  in  the  ordered  sample,  while  in  M- 
estimators,  the  weights  are  determined  by  the  relative  distances  of 
the  observations.  It  is,  therefore,  informative  to  study  the 
weighting  structure  of  the  Bayesian  posterior  mean.  Two  things  of 
special  interest  would  be  the  variables  through  which  the  weight 
for  each  observation  is  determined  and  the  roles  a  and  k  play 


In  this  estimator. 

A 

As  stated  previously,  the  posterior  mean  is  1  w(r)9(r) 
with  formula  for  and  8^  given  by  (5.1)  and  (5.3). 

Box  and  Tiao  (1963)  give  the  following  two  equations 

vs(r)  ‘  «I-*<V(r)-X(r);>'(I-*X(r)'X'X>'1x(r)!'1<V(r)-X(r)?> 

<  5.5) 


e(r)  ■  0-<{)(x,x)‘1xjr){i-4>x(r)(x,x)"1xjr)}'1(Y(r)-x(r)e). 


(5.6) 


Using  equation  ( 5.6) 


V)‘X(r)V)  ’  Y(r)-X(r)S+«X(r)'X'X>'1x(r)<I-*X(r)<X'X>'lx(r)>'1 

*  (Y(r)‘X(r)6) 
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Since 


U+«l>x(r)(x,x)'1x|r){i^x(r)(x,x)'1x|r)}'1} 

•  {I-<f>X(r)(X'X)‘1X|r)} 

■  i-4>x(r)(x,x)'1xjr)+w(r)(x,x)"1x;r) 


it  follows  that 


{I+<^>X(r)(X,X)"1XJr){I-<^>X(r)(X•X)‘,XJr)}■,}  =  {I-4»X(r)(X1X)',Xjr)} 


•YA"^Y'  - 


'  I V  ' 


Equation  (5.7)  becomes 


Y(r)-X(r)V)  =  {WXr^U'xr'XJ^r'tY/^-X^.e) 


(r)J  V,(r)-A(r)' 


50  Y(r)-X(r)S  '.{I-*X(r)(X'X)'lx(r)}'Y(r)-X(r)V))-  <  5'8> 

Dividing  both  sides  of  equation  (5.5)  by  vs|r)  and  using 
equation  (5.8),  we  have 


4-'  -4-<x(,r*^1e),n-+x,M(x'x)-1xjr)}-1(v(r)-x(r)S) 


^  <r)  (r) 


,+  J-<V(r)-X(r)V)>,tI-*x(r)tX'X)'lx(r)1, 

(r) 


•{HX(r)(x,x)'1xjr)r1{wx(r)(x,xr,x;r))(Y(r)-x(r)e(r)) 


I  V  < 


u  ♦[VrViV) 


vl  Si 


{I-4>X(r)(X*X)'1XJr)}  Y(^V)e(r) 


(5.9) 
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Hote  that  6^  and  s|r^  are  the  estimates  obtained  by 
downweighting  the  observations  which  under  the  assumption  that  a^j 
occurs,  come  from  a  distribution  with  the  larger  variance.  Also 


Y(rrx(r)9(r} 

f/„\  =  -A  /.  \  /  ^  /.  is  the  residual  vector  for  the  observations 


(r) 


(r) 


which  are  downweighted  under  such  an  assumption. 
From  {  5.1),  we  can  write 

1 

ix-xl2 


w(r)  =  C<T^ 


c(A)Vr 


lX'X^X(r)X(r)|1/2  /$2 


(r 


r 


i 

i  v  1 2 


.  C(  «  >V - 1X31 

I ~U  I v i v  *  a v i 


(1+  ^  r! 


1/2  v  1  (r) 


lX'X-*X(r)X(r) 

•  (I-W(r)(K'X)-’xir)}r(r)}? 


tv 


If  we  let  E  =  *(I-*X(r)(X'X)*1x;r))^  ,  then 

ix'xr’iEi  *  ix,xr1ii-w(r)(x,x)‘1x(r)i  (^$)r 


jj 

X 

X 

X 

-5 

= 

I 

W(r) 

IW(r)  1  1 

X(r> 

X'X 

(  5.10) 


*> 


*  |I||X'X-X|r)<.X(r)|(^)r  =  |X,X-$X{r)X(r)|(^-  <>)r. 
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y.-x  .0 

If  we  write  =  l/w( — ^ — )  ,  then  (6.1)  can  be  written  as 


=  0 


or  equivalently 


X'V^Y  -  X'V^Xe  =  0 


where 


The  solution  §  is  then  given  by  (X'V-1X)-1X‘V-1Y-  This  is  the  satr.e 


solution  as  the  weighted  least  square  estimates  in  a  linear  model 

o2  and  the  weighting  function  w 

corresponds  to  the  weights  v,.-^  used  in  the  weighted  least  square 

method.  In  order  to  compare  the  weighting  pattern  for  the  Bayesian 
posterior  mean  with  the  w  function  in  an  M-estimator,  one  needs  to 
write  the  posterior  mean  approximately  in  the  form  of 

(X'V  ]X)  ^ X ' V  ^  Y  and  the  matrix  V  will  then  provide  the  informa¬ 
tion  about  the  weighting  structure.  It  is  not  always  possible  to 
obtain  such  a  matrix  V  for  the  general  linear  model,  and  even  when  pos¬ 
sible,  its  form  is  often  complicated.  However,  in  the  special  case 


when  Var(Y)  =  Vo 


2  _ 


V1  0 


0  v ' 


-16- 


of  a  location  parameter,  simple  results  can  be  obtained  and  much 
insight  gained. 


The  Location  Case 


Consider  the  model  y^  =  8  +  e.. 

Ej  i.i.d.  from  (l-a)tl(0,o2)+  aN(0,k2c2)  ,  i  =  l ....  ,n. 
Then  X'X-n.  X{r)X(r)  -  1 .  I-<S>X(r } (X •  X)"1  X Jr ,  =  I-  J  X(r)Xjr) 
and  following  (4.5.11)  w^rj  reduces  to 


1 

2 


k'r^>  {,+  *  r(r)(I-  5  X(r)X(r))r(r)> 

Also  from  (5.6) 


?{n-l ) 


(  6.2) 


®(r)  ~  ~  n-r<|>  ^(r)"^ 

where  y  is  the  sample  average  and  y^  is  the  average  of 

elements  of  Y, 

In 

The  posterior  mean  is  then 


JL  1 

k  {1+  iTT  r(r)(1‘  nX(r)X(r))r(r)}^ 


X  V~y))  ‘ 


(6.3) 


We  shall  now  discuss  this  formula  in  the  following  cases: 

(D  when  there  is  at  most  one  discordant  observation  in  the  sample, 
ot)  when  there  tire  at  most  two  discordant  observations  in  the 
sample. 
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Only  One  Discrepant  O.  rva tirn  P  on  t 


In  this  circumstance,  the  associated  with  two  or  more 

observations  coming  from  the  distribution  with  the  larger  variance 
will  be  negligible.  The  terms  left  in  (6.3)  would  then  be 
associated  with  no  bad  values  and  with  one  bad  value. 


Ctj,c(T^)k'1(^n+  &  ri°-  n>r/"'V 


1 


n  i  2  .  .  i{n-l )  , 

'  et,!1e,iS»,k‘ ^  (1+^nr’1)  6-4) 


where  c  is  a  constant  such  that  the  weights  would  sum  up  to 
unity. 


Let  R. 


(l+-LHr?)2  and 

v  n-l  n  l 


R  -  I 

i=l  1 


then  c  *  T,  c(l?S>k'1(imF)2,ti 

or  equivalently 

■  1 


which  implies 


=  1 


C  ■  VO  -tV'V/r) 


Substituting  c,  and  R,  (6.4)  becomes 
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).  c 
i=l 


—2—  y. 
n-^>  J-\ 


r  i  i 

I  c(^-)k'\~)2  -L  Ry-'cCr^Jk'^—)7^  $R.y. 
^•j  1-a'  vn-<p 7  n-<*>  l  1-a'  vn-4>  n-<j>  ri 


JL  _L 

1-a  'n-4>'  n-<? 


1+  —  R 

1-a  vn-4>' 


r-  XO^I 


We,  therefore,  have 


posterior  mean  - 


i  -prk*1  (~r)  R  n  R-<frR. 

I _ -  ,  1-a  vn-<j>' _  r  i 

i y  r  i-s  t^tr 

„  In2  «  1  n  2  1 

&  <H?*>  R  1+T^  <^>R 

(  6.5) 


Let  Y 


n  R-<f>R . 

0  -y- ordinary  *eai.,  Y,  ’  ^  y. 


’+>V'0  R 


the  posterior  mean  =  QYq  +  (l-Q)Y^ 


It  is  clear  now  that  the  posterior  mean  can  be  written  as  a 
convex  combination  of  two  location  estimates,  Yn  —  which  is  the 


conditional  posterior  mean  given  that  there  is  no  outlier 


and 


Yj  —  which  is  the  conditional  posterior  mean  given  there  is  exactly 

one  outlier.  The  quantity  Q  is  the  posterior  probability  that 
there  are  no  outliers,  and  1-Q  is  the  posterior  probability  that 
there  is  exactly  one  outlier.  The  crucial  quantities  here  are  R.'s, 

which  are  the  inverse  of  the  t-ordinates  of  r.'s.  If  a  particular 

observation  y^  is  abberant,  the  corresponding  residual  r..  will 

be  large.  By  taking  as  the  inverse  of  the  corresponding 

t-ordinate,  contrasts  with  other  R-'s  will  be  exaggerated.  Thus 

J 

MR. 

will  be  made  relatively  small.  The  quantity  will  then 
put  little  weight  on  y^.  Also  if  there  is  a  discrepant  observa¬ 
tion,  R  will  be  large  which  will  make  Q  small  and  the  poster¬ 
ior  mean  close  to  Y-j.  Conversely,  if  there  are  no  discrepant 

observations,  the  posterior  mean  will  lie  close  to  Y. 

The  value  of  Q  depends  not  only  on  the  sample  through  R, 
but  also  on  values  of  ct  ar.d  k.  If  one  assumes  a  larger  prior 
probability  a  of  outliers,  then  the  posterior  probability  of  no 
outlier  could  be  smaller.  Notice  too  that  the  posterior  mean 
depends  on  a  only  through  Q  so  that  the  role  that  a  plays 
is  only  to  adjust  the  overall  weight  associated  with  fixed  numbers 
of  outliers. 

When  k  is  fairly  large,  $  -  1  -  —  would  be  close  to 

k2 

1  (<}>  =  .96  when  k  =  5).  From  the  formula  of  R^  and  (  6.5),  it 
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seems  that  would  be  rather  insensitive  to  k  since  it  involves 

only  and  so  is  R.  If  we  let  6  =  k"^  and  approximate  0 

by  1,  then  the  posterior  mean  involves  only  one  constant  G.  One 
interpretation  associated  with  G  is  the  following. 

Assume  y^  is  an  observation  from  (l-a)N(0,o2)+aN(O,k2a2 ) 

then  G  is  the  ratio  of  the  posterior  probability  that  y^  is 

from  N(9,k2o2)  to  the  posterior  probability  that  y^  is  from 

N(0,a2)  given  that  y^  =  0. 

P(y1~N(e,k2o2)/y1=9)  P(y1-N(9,k2o2),y1=e) 

P(y1-N(6,a2)/y1=0)  P(y1~N(9,a2),yl=0) 

P(y|=0/y1~N(0,k2a2))P(y^N(0,k2a2)) 

P(y1=e/y1‘'N(0,o2))P(y1~N(0,a2)) 


SZHko 

-  (1-a) 

m To 

G  can  also  be  viewed  as 


«  1  -  r 
1-a  k  “  G  ' 


expected  value  of  /information  from  N(0,k2o2) 
expected  value  of  /information  from  N(0,o2) 
probability  of  an  observation  being  from  N(0,k2a2) 


_ »  /information  in  N(0.k2o2) 

probability  of  an  observation  being  from  U(0,o2 


•  /information  in  N(0,o2) 
a  /j/k2c^  _  n  1_ 

1-a  /T/o2  ^  k 
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So  for  k  large,  we  need  to  determine  only  one  parameter  in 


the  model ,  namely  G. 


The  Weighting  Pattern 

In  the  case  of  a  single  location  parameter, 


(X,V‘1X)‘1X'V'1Y  = 


Thus  the  weighting  function 


used  in  obtaining  the  weighted  least  square 


estimate  is  proportional  to  the  weight  given  to  each  observation. 
It  is,  therefore,  possible  to  write  the  posterior  mean  in  the  form 


of  (X‘ V“7X)”7X' V”7 Y  and  compare  V*1  with  the  weighting  patterns 
used  in  the  M-estimators. 

If  we  assume  at  most  one  outlier,  the  posterior  mean  is 


QY0  +  (1-Q)V1 
,!,("*  (,-q> 


as  shown  above.  It  can  also  be  written  as 


R-<t>R.  \ 

•  Therefore’ 

l  o  R”4’R-j 

V1  “n+  (,-q) 


The  right-hand  side  is  a  function  of  r^;  we  can  thus  obtain  the 

weighting  pattern  of  this  Bayesian  posterior  mean  in  terms  of 
the  standardized  residuals.  This  weighting  pattern  is  sample  de- 
pendent.  V.'o  illustrate  with  a  random  sample  of  size  10  from  N(0,1). 
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Ten  observations  generated  from  a  computer  random  sampling 
routine  and  the  corresponding  residuals  r/s  and  weights 

(vT^)'s  with  a  =  .05  and  k  =  5.0  are  listed  in  Table  1. 

It  is  seen  that  the  weight  given  to  each  observation  is  approxi¬ 
mately  constant  and  close  to  .1.  The  weights  change  when  a 
discrepant  observation  is  present.  Vie  have  added  in  turn  1,  2,  3, 
4,  5,  to  the  seventh  observation  and  again  calculated  the  r..  and 

vT^;  the  results  are  also  shown  in  Table  1. 

If  we  plot  v'1  versus  r^,  with  0,  1,  2,  3,  4,  5  added 

to  we  found  that  they  lie  on  a  smooth  curve  and  such  a  curve 

is  comparable  with  the  weighting  function  of  the  M-estimators  since 
they  both  determine  the  weights  given  to  observations  as  a  function 
of  properly  standardized  residuals.  Figure  i  gives  plots  of 

vj1  versus  r^  for  a  =  .01,  .05  and  .10  and  k  =  5.0.  In  terms 

Of  u,  they  are  weighting  curves  for  G  =  .002,  .011  and  .022. 

The  weighting  curves  for  M-estimators  are  shown  in  Figure  2. 
Examination  of  Figure  1  shows  that  the  weights  are  roughly  equal 
for  residuals  between  zero  and  one  and  start  to  descend  as  they  go 
farther  and  farther  away  from  the  center.  The  larger  a  is  (or 
the  larger  G  is)  the  sooner  and  faster  the  weights  decrease. 
Comparing  with  Figure  2  this  weighting  pattern  has  the  desired 
property  (Pcaton  arc]  Tukey,  1974)  that  it  is  almost  as  constant  in 
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1.0  2.0  3.0  4.05.0  6.0  1.0  2.0  3.0  4.0  5.0  6.0 

r * 


a  =  .10 
,t  G  =  *022 


)a  la 


Figure  1  Weighting  curves  for  different  a(G) 
values  with  k  =  5.0. 
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1.11 
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.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

.10 

add 

2 

to  7- 

■th  observation 

y7  = 

2.52 
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-.55 

-.12 

-.17 

-1.75 

-1.55 

.88 

2.78 

-.52 

.81 

.56 

.10 

.10 

.10 

.10 

.10 

.10 

.09 

.10 

.10 

.10 

add 

3 

to  7- 

•th  observation 

y7~~ 

3.52 

ri 

-.54 

-.18 

-.22  ' 

-1.50 

-1.35 

.64 

3.87 

-.52 

.58 

.38 

.10 

.10 

.10 

.10 

.10 

.10 

.07 

.10 

.10 

.10 

add 

4 

to  7- 

th  observation 

y7  = 

4.52 

ri 

■ 

-.52 

-.22 

-.26 

-1.31 

-1.19 

.46 

4.90 

-.50 

.42 

.25 

.11 

.11 

.11 

.11 

.11 

.11 

.04 

.11 

.11 

.11 

add 

5 

to  7- 

th  observation 

y7  = 

5.52 

ri 

« 

-.50 

-.25 

-.28 

-1.16 

-1.06 

.34 

5.85 

-.49 

.30 

.15 

v^1 

.11 

.11 

.11 

.11 

.11 

.11 

.02 

.11 

.11 

.11 

Table 

1 

The  correspondences 

among 

*1* 

r..  and  vT^ 

with  a  =  .05  ,  k  =  5.0 
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the  center  (normal -1 i he)  as  Huber's  and  Hampel's  and  reduces  to 
small  values  as  smoothly  as  Andrews'  and  Tukey's.  Smoothness 
may  be  another  desired  property  since  a  sudden  change  in  the 
slope  is  rather  artificial  and  one  does  not  have  reason  to  expect 
such  a  change. 

Figure  1  depends  on  a  particular  sample.  Hov/ever,  in  all, 
totally  five  samples  were  drawn.  All  of  the  weighting  patterns  are 
similar  and  those  in  Figure  1  are  typical. 

As  discussed  before,  when  k  is  large,  G  seems  to  be  the 
only  relevant  parameter  to  be  determined.  Figure  1  has  shown 
the  weighting  patterns  for  G  =  .002,  .011  and  .022  where 
systematic  changes  have  been  observed.  With  this  same  sample,  one 
can  also  fix  G  and  calculate  the  weights  for  different  a  and 
k  values.  Figure  3  exhibits  five  weighting  patterns  all  with 
G  =  .011  and  the  following  pairs  of  a  and  k  values. 

a  .050  .095  .136  .174  .208 

k  5.0  10.0  15.0  20.0  25.0 

All  five  curves  look  very  much  alike  with  the  greatest 
discrepancy  being  that  for  the  values  (a,k)  =  (.05,  5.0).  This 
confirms  our  conjecture  that  G  is  the  only  critical  parameter 
when  k  •>  5. 

At  Host  Two  Discordant  Observations  Present 

As  we  have  seen,  the  terms  in  (6.3)  corresponding  to 
no  Had  values  and  one  bad  value  can  be  written  os  cy  and 
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1 

,2  n  R-<J,R. 

C  k~  (j“)  R  l  ^'n~  yi  •  The  term  corresponding  to  two  ' 


bad  values  is 


a  \2,-2/  n 


L‘<T^>VVw>  «♦  n?T  ty1- £  W 


U 

i<j 


ifn-D 


x  (y  - 


^{Vy)) 


a  >*,-2,  n 


i,3=i 

1<j 


r 

nil 

.1 

1 1+  r!  . 

j  n-1  ij 

n 

n 

l  .£ 

mi 

a 


£(n-l) 


n  n 


(6.6) 


where  r . .  = 

I J 


y,-+y,- 


r!j 


rj. 

1j 


ri  _  yi~9ij 
’  ij  sij 


r . .  = 

1J 


y  .-8. . 

s  •  • 
1J 


-  11 

y^j  =  — »  an^  6ij  anc*  sij  are  estimates  obtained  by 

weighted  least  squares  in  which  the  i-th  and  j-th  observations  are 
downweighted. 


If  we  let  T..  -{  1+  -^r  r! . 

ij  )  n-1  ij 


ml  .1 

n  n  |  r.. 
.1  nil)  u 

n  n 


1 

JJ 


^(n-1 ) 


and 


’  ■  ■ 

1|*j 


^(n-1 ) 


(  6.7) 
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(G.7)  reduces  to 


y  C(-°L)  T  /_n_  -  _  _g_0_ *  1  / 

^j-1  C  1-cr  'n-2y>  ij^n-2^  y  n-2<j>  2  ' 

i<j 


2<J>  yi+yj 


=  r  r  ,  a  \ 2 1,-2  /  n  _1_  (/  „  ^  ,  a  .*  -2,  n  , 

l-cx  '  ^ 1 1 - 2 n-2<^  yi  .  '  ‘ ^ n-2<j> ^ 

•  ~  *  I  9  J  “  I 


1 


1<J 


x  — ^Wr  T .  •  (y  -+y  •) 
n-2$  ijVJi  yj' 


It  is  easy  to  see  that  T..  =  T..  so 

*  J  J  * 


I  Tii^yi+yi^  =  I  Tiiyi  +  l  Tiiyi 
i,j=l  1J  1  J  i,j=l  1J  1  i,j=l  J1  J 

i<j 


i<J 


i<J 


n  ■  n  n  n 

=  y  t.  .y.  +  y  t.  -y.  =  y  y  t.  . 
■  1.3-1  ,J1  i  j-l  1J  1  1=10=1^ 

i<j  i>j  j/i 

The  above  formula  can  now  be  written  as 


1 


yi  * 


1 


c  y  k“2/_n _ \  t  _ 1 _ v  _c  y  t  a  _ P._^  _ 

js|  1-a  '  n-2(^^  n-2^  -  i  1-cr  '  ^n-2$'  rw!<i> 


:  l 


^  2  /)  n 

a  y~/(  - ---)  y 

CM-ct;  K  ln-2«;  n-2^> 


2  ')  ..2  n 


a  c/..o_)V2/  J\\  T  v  on  ■ 

K  V2,<,)  T  “TiP<i?JT  yi  * 


T-jMyi 


n 


T-«.LTij 

j=i  J 

,i7i  ' 


i-1 
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ii  r— 1Z3 


Thus  the  posterior  mean  will  be 


T?5  +  c(*)V‘(«3*)  Ts 


1 

2  n  R-$R. 


a  \2, -2/  n 


1 

2 


We  have 

1  1 


which  implies 


c  =  1/ 


1 


t?s  k' WR  +  T 


a  N2, -2,  n 


1 

7 


If  now 


Yo  =  y 


n  (R-  R.) 

Y1  =  '  i^1  (n-$)R  yi 


|T^1T« 
jli. 


Y2  =  iJ1  (n-2*)T  yi 


we  can  write  the  posterior  mean  as 

1 


,+ R  +  (t^)V£(^?)  T 


1 

7 


T~  Y0 


a  \2i -2/  n 
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1 


the  posterior  mean  is  then  a  weighted  average  of  three  estimators, 

Q0Y0  +  +  Q^Y 2  »  where  Y..  is  the  conditional  posterior  mean 

given  that  there  are  exactly  i  bad  values  and  Q.  is  the 

posterior  probability  that  there  are  i  bad  values. 

Generalization  to  the  case  of  t-bad  values  is  not  difficult 
but  is  tedious.  In  general,  however, 

O)  The  posterior  mean  for  t-outliers  is  a  weighted  average 

QqY  +...+(1^  of  i+1  estimators  where  is  the 

conditional  posterior  mean  given  that  there  are  exactly 
1  bad  values  and  is  the  posterior  probability  of  i 
bad  values. 
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(2)  If  k  is  largo  then  <J>  can  be  taken  to  be  1  and  G  is  the 
only  parameter  to  be  determined  in  advance  in  the  model. 

(3)  Each  Yj  itself  is  a  weighted  average  with  the  weight  for  the 


j-th  observation  heavily  dependent  on  the  series  of  quantity 


(  6.8) 


for  all  possible  combinations  of  j'i  .jg,.  •  •  • 

In  this  expression,  r . .  .  is  the  residual  vector  for 

J  J  1  •  •  •  J  .j  _*| 

observations  y.,y.  ,...,y.  calculated  from  the  estimate 
J  J]  J i _! 

where  these  observations  are  down  weighted.  If  the  j-th 

observation  itself  is  a  bad  value, then  (6.8)  will  always 

be  large  since  r].  ■  will  be  large.  When  {y., 

jJl  •  •  •  J  ^  _  -j  J 

y*  ....»y<  )  are  exactly  the  i  bad  values  in  the  sample, 

J1  Ji-1 

the  above  quantity  reaches  its  maximum.  This  will  result 
In  a  small  weight  for  the  j-th  observation.  If  the  j-th 
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observation  is  not  a  bad  value  then  no  matter  how  we  choose 


the  expression  (6.8)  will  never  reach  its 

maximum  and  so  the  weight  given  to  the  j-th  observation 
will  not  be  very  small . 

(4)  Equation  (  C.8)  is  a  direct  extension  of  (  6.7).  From  the 
form  of  (6.7),  it  is  seen  that, if  the  absolute  values  of 

rj,  and  r].  were  fixed,  the  value  of  (6.7)  would  be 

I J  I J 

larger  when  they  were  both  positive  or  both  negative  and 
smaller  when  they  had  opposite  signs.  This  implies  that 
if  there  are  two  bad  values  in  a  sample,  these  bad  values 
will  have  less  weight  if  they  fall  on  different  sides  of 
the  mean  and  more  weight  if  they  fall  on  the  same  side.  For 
the  case  of  l  bad  values,  similarly,  equation  (6.8) 
indicates  that  the  bad  values  will  have  less  weight  if  they 
are  equally  spread  on  both  sides  of  the  mean  and  more  weight 
if  more  of  them  are  concentrated  on  the  same  side. 

Example:  Darwin's  Data 

We  illustrate  how  observations  are  weighted  in  such  a  Bayesian 
analysis  using  Darwin's  data  concerning  fifteen  differences  of  the 
heights  of  cross-  and  self-fertilized  plants  quoted  by  Fisher 
(1960,  p  37).  Our  interest  here  will  be  to  examine  more  closely 
the  weighting  structure  of  the  Bayesian  posterior  mean,  and  to  fur¬ 
ther  develop  the  analysis  of  Box  and  Tiao  (1968). 


-31- 


The  data  consist  of  measu regents  on  15  pairs  of  plants,  each 
pair  contained  a  self-fertilized  and  a  cross-fertilized  plant 


grown 

in  the 

same 

pot.  The  15 

differences 

yi 

were  recorded  in 

Table 

2  in 

ascending  order. 

*1 

y2 

y3  y4 

y5 

y6 

y7 

y8 

-67 

-48 

6  8 

14 

16 

23 

24 

*9 

y10 

yll  y12 

y13 

y14 

y15 

28 

29 

41  49 

56 

60 

75 

Table  2  Darwin's  Data 


Inspection  of  the  data  indicates  that  two  observations  -67  and  -48 
are  remote  from  the  rest.  We  shall  assume  there  are  at  most  two 
bad  values  in  the  sample  and  k  =  5.  Then  the  posterior  mean  ! 

A  ^  . 

0  =  QqYq  +  +  Q2 .  Applying  these  results  to  Darwin's  data 

we  find  that  YQ  =  20.97. 

To  calculate  Yj ,  we  need  r.  -  the  standardized  residual, 
R-4>R. 

Rj.  R  and  -  the  weight  Y^  puts  on  the  i-th  observation. 

These  values  are  given  in  Table  4.3  resulting  in  the  value 

Y,  =  24.65  . 
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between  zero  and  one  and  dramatically  increases  when  |r^| 
becomes  larger.  For  this  sample  is  much  larger  than  the 

others  and  results  in  a  small  weight  for  y^. 

The  calculation  of  Y«  is  more  complicated  ;  we  need  rl.. 

Cm  I  J 

i  n 

H-,  T..,  T,  and  Y  T...  Table  4  gives  all  the  relevant 
ij  ij  ij 

J7i 

quantities.  Since,  by  symmetry  T. .  =  T-.,  We  use  the  upper 

I  J  J  • 
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•  -IT 

right  half  of  the  table  to  record  r..,  r . .  ,  T..  and  0.., 

ij  lj  lj  lj 

and  the  lower  left  half  to  record  (rV)2,  (r^.)2  ,  (r^  .  +  r'?-)2 

v  ij'  '  ij'  lj  lj' 

a"tl  fiio  =  '  -5^T((rU)!+(rU)J  •  n  (rij  +  Their 

corresponding  positions  are  explained  at  the  bottom  of  Table  4. 

n 

For  each  i,  £  T. .  is  given  in  the  last  row  and  T  is  shown 
j  =  l  1J 

at  the  lower  right  corner.  From  these  quantities,  one  can  then 

n 

j=l  1J 

calculate  "(n-2^)T~  ’  tfie  wei'9^t  applied  to  the  i-th  observation 


(Table  5) 

and 

Y2  =  30. 

,74. 

1 

1 

2 

3 

4 

5 

6 

7 

8 

*1 

-67 

-48 

6 

8 

14 

16 

23 

24 

n 

T-4>  l  Ti i 

j=l  1J 

js'i 

ijr-2Trr 

.007 

.017 

.075 

.075 

.075 

.075 

.075 

.075 

9 

10 

11 

12 

13 

14 

15 

*i 

28 

29 

41 

49 

56 

60 

75 

n 

T-*XTij 

J=1  J 

jJM 

(n-2fji 

.075 

.075 

.075 

.075 

.075 

.075 

.073 

T 

able 

5  The 

weight 

r.ppli,: 

:d  to  < 

each  observation 

in  Y, 

-33- 


We  again  noticed  that  T--  which  is  proportional  to  the 

*  J 

inverse  of  a  t  ordinate  at  ( rl  - ,  r?.)  is  not  sensitive  to  rl. 

'  J  *  J  *  J 

and  r^-  when  they  are  both  between  zero  and  one.  But  they 

increase  at  a  very  fast  rate  if  either  one  gets  large  (over  1.5, 
say),  and  even  faster  if  both  are  large.  In  Table  4 

is  larger  than  average  for  all  j  since  the  standardized  residuals 

(r15j)'s  are  1ar9er  than  1*5.  T2j's  are  even  larger  but 

still  not  as  dramatically  large  as  for  all  j.  And  the 

1  2 

dominating  term  is  really  where  both  r^  and  r^  are 

n  n 

large.  Under  such  circumstances,  T  T, .  and  7  T9.  are  much 

0=1  1J  j=l  ° 

0^1  0/2 

n 

larger  than  all  other  £  T..  since  they  are  the  only  two  which 

0=1  1J 

j/i 

Include  T,,  in  the  sunnation.  This  results  in  the  small  weights 


for  y,  and  y0  in  Table  5. 


So  far  we  have  VQ  =  20.93,  Y]  =  24.65,  Y£  =  30.74  and 
furthermore  we  can  calculate 


1+G  •  61.04  +  G2* 4720. 35 

_ 6  •  f.l .  04 _ 

1-SG  *  61.04  +  G2*  4720. 35 

.  (i2  •  4720,35  _ 

1-tG  •  61.04  G2’ 4720. 35 


r  _  a  1 
G  "  l^k 
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*- 


The  values  taken  by  Qq,  depend  on  R,  T  and  G.  The 

larger  G  and  T  are,  the  larger  Q2  will  be,  the  larger  R  is, 
the  larger  will  be.  Since  R  and  T  are  fixed  once  obser¬ 
vations  are  obtained,  and  k  is  fixed  at  5  for  this  example, 
the  only  thing  we  can  change  now  is  a  (or  G).  Table  6  gives 
Q0»  ,  Q2  and  posterior  mean  QqYo  +  Q^Yj  +  Q2?2  for  different 

a  (or  G)  values. 


a 

.01 

.05 

.10 

.20 

G 

.002 

.011 

.022 

.050 

«0 

.875 

.462 

.213 

.063 

0) 

.108 

.297 

.289 

.193 

.017 

.242 

.497 

.744 

+QiY,+Q,Y, 

21  .50 

24.41 

26.89 

28.95 

Table  6  Qq,  ,  Q2  and  posterior  means  for  different 
a  (or  G)  values. 

It  is  seen  that  different  choices  of  a,  and  hence  of  G, 
greatly  influence  the  value  0^  and  hence  change  the  emphasis 

on  Y0,  Yj,  and  Y2>  leading  in  this  example  to  different  results. 

When  there  are  no  bad  values  or  very  obvious  bad  values,  the 
results  are  not  particularly  sensitive  to  the  choice  of  a.  In 
practice,  it  may  be  worthwhile  to  actually  run  the  analysis  on  the 
computer  and  change  the  value  of  a  (or  G).  Estimates  which  are 
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very  sensitive  to  different  choices  of  a  (or  C)  indicate  strongly, 
as  in  this  example,  the  presence  of  bad  values. 

Empirical  Study  of  the  Depende ncy  of  the  Posterior  Kean  on  G 

Recall  we  concluded  that  G  is  the  only  important  parameter 
the  results  depend  on  when  k  is  large.  In  the  case  of  one  bad 
val  ue,  we  have  shown  that  this  is  generally  true  for  k  >m.  Tor 
the  general  case  of  £  bad  values,  we  believe  this  is  still  true 
but  we  probably  will  need  a  larger  k. 

Box  and  Tiao  listed  posterior  means  for  Darwin's  date  with 
different  choices  of  a  and  k.;  from  these  we  can  calculate  G 
corresponding  to  each  posterior  mean,  as  shown  in  Table  7. 


«k 

5 

6 

7 

8 

9 

10 

.01 

21.50 

21.45 

21.40 

21.35 

21.31 

21.27 

(.0020) 

(.0017) 

(.0014) 

(.0013) 

(.00111 

(.0010) 

.02 

22.21 

22.11 

22.00 

21.89 

21 .79 

21.71 

(.0041) 

(.0034) 

(_l0029) 

(.0026) 

(.0023) 

(.0020) 

.03 

22.97 

22.84 

22.67 

22.50 

22.35 

22.22 

(.0062) 

(.0052) 

(.0044) 

(.0039) 

(.0034) 

( .0031 ) 

.04 

23.71 

23.56 

23.36 

2  3. "14 

22.94 

22.76 

(.0083) 

(.0069) 

(.0059) 

(.0052) 

(.0046) 

( .com 

.05 

24.41 

24.26 

24.03 

23.73 

23.54 

O  -a  ^  o 

c o . 04 

(.0105) 

(.0083) 

(.0075) 

(.0056) 

( . 005S) 

0053) 

.06 

25.03 

24.90 

24.65 

24.39 

24".  1 3 

23.87" 

_ (_.0128) 

(.0106) 

0091  ) 

0080) 

(.0071  ) 

(.0064) 

.07 

25.59 

25.47 

25.24 

24 . 96 

24.63 

24 . 41 

(.0151) 

(.0125)  (.0103) 

J_.  0094 ) 

(.0034) 

(.0075) 

.08 

26.08 

25.99 

25.77 

25.49 

25.21 

24.92 

(.0174)  (.0145) 

(.0124) 

(.oioo) 

J.0097) 

(.0037) 

.09 

26.51 

26.45 

26.25 

25.93 

25 . 69 

25 . 40 

(.0198)  (.0165) 

(.0141) 

(.0110)  (.'.099) 

Ao 

26.89 

26.  S6 

26.67 

26.42 

2b.  13 

25.85 

(.0222) 

(.0135) 

1-015") 

(.0139)  (.0123) 

l-oni) 

Table  7  The  posterior  mean  for  Darwin's  data  with 
different  a  and  k  and  the  corresponding 
G  given  in  the  bracket  under  the  posterior  mean. 


It  is  clear  from  the  table  that  for  fixed  G,  the  posterior 


means  are  in  general  not  too  different,  for  example, 


G  = 

.0111 

(a  =  .10, 

k  =  10.0) 

25.85 

G  = 

.0110 

(a  =  .09, 

k  =  9.0) 

25.69 

G  = 

.0109 

(a  =  .03, 

k  =  8.0) 

25.49 

G  = 

.0108 

(a  =  .07, 

k  =  7.0) 

25.24 

G  = 

.0106 

(a  =  .06, 

k  =  6.0) 

24.90 

G  = 

.0105 

(a  =  .05, 

k  =  5.0) 

24.41 

But  there  is  a  systematic  change  as  k  decreases  indicating  the 
dependency  on  k.  It  is  also  seen  that  such  a  change  is  smaller 
when  k  is  larger.  For  this  example,  it  seems  that  we  need  to 
have  k  at  least  as  large  as  7  for  the  results  to  be  essential! 
dependent  on  G. 

In  general,  this  property  would  depend  on  the  ratio 

•  The  larger  this  ratio  is,  the  larger  k 
number  of  ooservations  3  3 

must  be  for  G  to  be  the  dominant  parameter. 


The  2  Factorial  Desiqn 


It  is  much  more  difficult  to  obtain  results  in  terms  of 

weighting  as  soon  as  we  deal  with  more  complicated  designs.  As  a 

2 

simple  example,  consider  the  2  factorial  design  with  the  linear 
model 

Y  «  XO  +  e 

where 
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"l 

-1 

-f 

/yl 

X  = 

1 

+1 

-1 

Y  = 

fy2 

1 

-1 

+1 

•• 

Iy3 

1 

+1 

+1 

U/ 

■^1 


If  v/e  assume  there  is  at  most  one  bad  value,  we  can  obtain  the 
posterior  mean  as  follows: 


Let 


p  '  * 


8(0)  *  posterior  mean  given  there  are  no  bad  values 


1 

4 


'y,  +y2  +y3  +y4' 
-y,  +y2  -y3  +y4 
-y,  -y2  +y3  +y4 


(i) 


the  conditional  posterior  mean  given  that  y 
is  a  bad  value 


1 


1 


6p+2 


“2py1  +(p+i)y2  +(p+i)y3  +  2py4~ 

-2py1  +  2py2  -{p+1 )y3  +  (p+l)y4 
-2py1  -(p+1 )y2  +  2py3  +  (p+1 )y4 


8(2)  =  the  conditional  posterior  mean  given  that  y2 
Is  a  bad  value 


6p+2 


2py3  +(p+l)y4 


” (p+1 )y!  +2py2  + 

-  2py1  +2py2  -(p+1 )y3  +(p+l)y4 
-(p+l)y1  -2py2  +(p+l)y3  +  2py4 
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=  the  conditional  posterior  mean  given  that  y3 
is  a  bad  value 

"(p+l)y1  +  2py2  +2py3  +(p+l)y4 

-(p+i )y1  +(p+i)y2  -2py3  +  2py4 

-  2py]  -(p+l)y2  +2py3  +(p+l)y4 

0^4j  *  the  conditional  posterior  mean  given  that  y4 
is  a  bad  value 

2py1  +(p+l)y2  +(p+l)y3  +2py" 
-(P+1)y1  +(P+l)y2  ~  2py3  +2py4 
-(p+l)y1  -  2py2  +(p+l)y3  +2py4 

and  from  (5.10)  we  have 


1 

6p+2 


1 

6p+2 


=  the  posterior  probability  that  the  i-tn  observation 
is  a  bad  value 


=  c 


a  1 

l-o  k  fA 


8 


(4-<J>)  -3(4-4>)-2 


{1  +  $rl(l-  f»r.) 


1 


o  1 
1-a  k 


‘  (4-<J>) 3-3(4-$)-2 


(1  +  <f>0- 


1 

2 


The  quantity  wQ  is  the  posterior  probability  that  there  are  no 
bad  values  and  is  equal  to  c.  Then, 


4  * 

the  posterior  mean  =  J  wiS(i)  * 

To  see  how  this  posterior  mean  weights  each  observation,  we  write 
the  posterior  mean  as 
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»i' °  °  0  1  ->  V  i"i  i  i  ii  vi  ° 0  0  M 

-1  +1  -1  +I  0  V  0  1+1  -1  +1  -1  +1  0  v2  °,°  H 

-1  -1  +1  +1  00  v3  °,  1  +1  -1  -1  +1  +1  00  v3  O.lj^l 

j  0  0  0  v4 j[l  +1  +1J/  L  JL°  0  0v4j|j,4j 


then  vT^  is  the  weight  applied  to  the  i-th  observation  in  this 
Bayesian  procedure. 

By  equating  the  above  formula  with  the  posterior  mean,  we 
can  solve  for  v^  and  we  have 

-1  1 _ 

V1  T4-4w1-w0)p+4w1+w“ 


-1  _ 1 _ __ 

2  (4-4w9-w  )p+4w?+w 


£.  \-T  "q/K  Tl<2  0 

V-1  cc  ^ _ l _ 

3  T4-4w3-wo;p+4w3+wo 


V-1  «  _ 1 _ 

4  (4-4w4-wo)p+4w4+wo 


Since  p  =  —  which  is  usually  small,  we  can  approximate  p  by  zero 

k2 

and  have  approximately 

-1  1 _  -1  1 

V1  “  4wy  wo  v3  “  '4w3+wq 

V-1  «  1 _  V-1  tt_J _ 

2  4w2+w0  4  4w4+wq 

The  results  are  not  surprising  because  we  would  expect  that  the 
weight  given  to  the  i-th  observation  would  be  small,  if  the  posterior 
probability  indicates  that  the  i-th  observation  is  bad. 


For  the  general  linear  model  where  we  do  not  have  orthogo¬ 
nality,  no  general  interpretative  results  of  the  kind  possible  for 
simple  cases  occur.  However,  of  course,  the  Bayesian  procedure  des¬ 
cribed  by  Box  and  Tiao  (19G3)  can  still  be  carried  through. 

8.  Summary 

The  Bayesian  procedure  proposed  by  Box  and  Tiao  (1968) has  been 
studied  in  further  detail.  We  conclude  that 

(i)  The  posterior  probability  that  a  particular  set  of  obser¬ 
vations  is  discordant  depends  on  the  inverse  of  the 
multivariate-t  ordinate  of  the  standardized  residuals 
corresponding  to  those  observations. 

(ii)  In  the  location  case,  the  posterior  mean  can  be  written 
as  QqY0  +  +...+  Q^Yf  if  we  assume  at  most  Jt  bad 

values  present.  The  quantity  Y.  is  the  posterior  mean 

given  that  there  are  i  bad  values  and  does  not  involve 
a.  The  quantity  is  the  posterior  probability  of 

i  bad  values  which  involves  both  a  and  k. 

(iii)  If  k  is  large  enough  so  that  <J>  is  approximately  nne, 

Y.  does  not  depend  heavily  on  k  ,  and  depends 

Cl  1 

almost  exclusively  on  G  =  y  .  Thus,  as  an  approxi¬ 
mation,  one  can  talk  about  this  procedure  in  terms  of  one 
para:.. 'ter  G  only.  For  this  to  be  true  in  the  location 


case  where  there  is  only  one  bad  value  in  a  sample  of  size 
10,  simulation  shows  that  k  must  be  larger  than  five.  As 
the  number  of  bad  values  becomes  larger  relative  to  the  sample 
size,  a  larger  k  value  seems  to  be  needed. 

(iv)  Bad  values  are  given  less  weight  if  they  are  more  evenly  spread 
on  both  sides  of  the  location  parameter  and  more  weight  if 
they  are  more  concentrated  on  either  one  side. 

p 

(v)  For  a  2  factorial  design,  it  is  shown  one  can  also  write 
the  posterior  mean  in  a  weighted  least  square  form.  In 
this  case,  the  weight  for  the  i-th  observation  is 

approximately  proportional  to  • 
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